Search CORE

14 research outputs found

An Intrinsically-Motivated Approach for Learning Highly Exploring and Fast Mixing Policies

Author: Mutti Mirco
Restelli Marcello
Publication venue
Publication date: 19/12/2019
Field of study

What is a good exploration strategy for an agent that interacts with an environment in the absence of external rewards? Ideally, we would like to get a policy driving towards a uniform state-action visitation (highly exploring) in a minimum number of steps (fast mixing), in order to ease efficient learning of any goal-conditioned policy later on. Unfortunately, it is remarkably arduous to directly learn an optimal policy of this nature. In this paper, we propose a novel surrogate objective for learning highly exploring and fast mixing policies, which focuses on maximizing a lower bound to the entropy of the steady-state distribution induced by the policy. In particular, we introduce three novel lower bounds, that lead to as many optimization problems, that tradeoff the theoretical guarantees with computational complexity. Then, we present a model-based reinforcement learning algorithm, IDE

^{3}

AL, to learn an optimal policy according to the introduced objective. Finally, we provide an empirical evaluation of this algorithm on a set of hard-exploration tasks.Comment: In 34th AAAI Conference on Artificial Intelligence (AAAI 2020

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

A Policy Gradient Method for Task-Agnostic Exploration

Author: Mutti Mirco
Pratissoli Lorenzo
Restelli Marcello
Publication venue
Publication date: 09/07/2020
Field of study

In a reward-free environment, what is a suitable intrinsic objective for an agent to pursue so that it can learn an optimal task-agnostic exploration policy? In this paper, we argue that the entropy of the state distribution induced by limited-horizon trajectories is a sensible target. Especially, we present a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), to learn a policy that maximizes a non-parametric,

k

-nearest neighbors estimate of the state distribution entropy. In contrast to known methods, MEPOL is completely model-free as it requires neither to estimate the state distribution of any policy nor to model transition dynamics. Then, we empirically show that MEPOL allows learning a maximum-entropy exploration policy in high-dimensional, continuous-control domains, and how this policy facilitates learning a variety of meaningful reward-based tasks downstream

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Politecnico di Milano

Association for the Advancement of Artificial Intelligence: AAAI Publications

Unsupervised reinforcement learning via state entropy maximization

Author: Mutti Mirco <1993>
Publication venue: Alma Mater Studiorum - Università di Bologna
Publication date: 29/03/2023
Field of study

Reinforcement Learning (RL) provides a powerful framework to address sequential decision-making problems in which the transition dynamics is unknown or too complex to be represented. The RL approach is based on speculating what is the best decision to make given sample estimates obtained from previous interactions, a recipe that led to several breakthroughs in various domains, ranging from game playing to robotics. Despite their success, current RL methods hardly generalize from one task to another, and achieving the kind of generalization obtained through unsupervised pre-training in non-sequential problems seems unthinkable. Unsupervised RL has recently emerged as a way to improve generalization of RL methods. Just as its non-sequential counterpart, the unsupervised RL framework comprises two phases: An unsupervised pre-training phase, in which the agent interacts with the environment without external feedback, and a supervised fine-tuning phase, in which the agent aims to efficiently solve a task in the same environment by exploiting the knowledge acquired during pre-training. In this thesis, we study unsupervised RL via state entropy maximization, in which the agent makes use of the unsupervised interactions to pre-train a policy that maximizes the entropy of its induced state distribution. First, we provide a theoretical characterization of the learning problem by considering a convex RL formulation that subsumes state entropy maximization. Our analysis shows that maximizing the state entropy in finite trials is inherently harder than RL. Then, we study the state entropy maximization problem from an optimization perspective. Especially, we show that the primal formulation of the corresponding optimization problem can be (approximately) addressed through tractable linear programs. Finally, we provide the first practical methodologies for state entropy maximization in complex domains, both when the pre-training takes place in a single environment as well as multiple environments

AMS Tesi di Dottorato

A Tale of Sampling and Estimation in Discounted Reinforcement Learning

Author: Metelli Alberto Maria
Mutti Mirco
Restelli Marcello
Publication venue
Publication date: 11/04/2023
Field of study

The most relevant problems in discounted reinforcement learning involve estimating the mean of a function under the stationary distribution of a Markov reward process, such as the expected return in policy evaluation, or the policy gradient in policy optimization. In practice, these estimates are produced through a finite-horizon episodic sampling, which neglects the mixing properties of the Markov process. It is mostly unclear how this mismatch between the practical and the ideal setting affects the estimation, and the literature lacks a formal study on the pitfalls of episodic sampling, and how to do it optimally. In this paper, we present a minimax lower bound on the discounted mean estimation problem that explicitly connects the estimation error with the mixing properties of the Markov process and the discount factor. Then, we provide a statistical analysis on a set of notable estimators and the corresponding sampling procedures, which includes the finite-horizon estimators often used in practice. Crucially, we show that estimating the mean by directly sampling from the discounted kernel of the Markov process brings compelling statistical properties w.r.t. the alternative estimators, as it matches the lower bound without requiring a careful tuning of the episode horizon.Comment: AISTATS 202

arXiv.org e-Print Archive

A Tale of Sampling and Estimation in Discounted Reinforcement Learning

Author: Metelli Alberto Maria
Mutti Mirco
Restelli Marcello
Publication venue: PMLR
Publication date: 01/01/2023
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano

Configurable Markov Decision Processes

Author: Metelli Alberto Maria
Mutti Mirco
Restelli Marcello
Publication venue
Publication date: 01/01/2018
Field of study

In many real-world problems, there is the possibility to configure, to a limited extent, some environmental parameters to improve the performance of a learning agent. In this paper, we propose a novel framework, Configurable Markov Decision Processes (Conf-MDPs), to model this new type of interaction with the environment. Furthermore, we provide a new learning algorithm, Safe Policy-Model Iteration (SPMI), to jointly and adaptively optimize the policy and the environment configuration. After having introduced our approach and derived some theoretical results, we present the experimental evaluation in two explicative problems to show the benefits of the environment configurability on the performance of the learned policy

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Politecnico di Milano

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Configurable Markov Decision Processes

Author: Metelli Alberto Maria
Mutti Mirco
Restelli Marcello
Publication venue: PMLR
Publication date: 01/01/2018
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano

Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State Entropy Estimate

Author: Lorenzo Pratissoli
Marcello Restelli
Mirco Mutti
Publication venue: {AAAI} Press
Publication date: 01/01/2021
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano

An Intrinsically-Motivated Approach for Learning Highly Exploring and Fast Mixing Policies

Author: Mutti Mirco
Restelli Marcello
Publication venue: 'Association for the Advancement of Artificial Intelligence (AAAI)'
Publication date: 03/04/2020
Field of study

Association for the Advancement of Artificial Intelligence: AAAI Publications

Unsupervised Reinforcement Learning in Multiple Environments

Author: Mancassola Mattia
Mutti Mirco
Restelli Marcello
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 16/12/2021
Field of study

Several recent works have been dedicated to unsupervised reinforcement learning in a single environment, in which a policy is first pre-trained with unsupervised interactions, and then fine-tuned towards the optimal policy for several downstream supervised tasks defined over the same environment. Along this line, we address the problem of unsupervised reinforcement learning in a class of multiple environments, in which the policy is pre-trained with interactions from the whole class, and then fine-tuned for several tasks in any environment of the class. Notably, the problem is inherently multi-objective as we can trade off the pre-training objective between environments in many ways. In this work, we foster an exploration strategy that is sensitive to the most adverse cases within the class. Hence, we cast the exploration problem as the maximization of the mean of a critical percentile of the state visitation entropy induced by the exploration strategy over the class of environments. Then, we present a policy gradient algorithm, alphaMEPOL, to optimize the introduced objective through mediated interactions with the class. Finally, we empirically demonstrate the ability of the algorithm in learning to explore challenging classes of continuous environments and we show that reinforcement learning greatly benefits from the pre-trained exploration strategy w.r.t. learning from scratch

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Politecnico di Milano

Association for the Advancement of Artificial Intelligence: AAAI Publications